This article introduces the %TS_HISTORY_CHECK macro which analyses the influence of the length of the available data history of a time series on the quality of the forecast for future periods.
Additionally, for each of the time series and the different simulation repetitions per time series, the length of the time series that delivers the best forecast quality is selected. Based on these results, the distribution of the optimal length of the time history over all-time series examples can be seen in more detail.
The rationale of the article and the work on this macro is that analysts usually invest a lot of time in finetuning their forecasting models. However little focus is given to the, positive and negative, influence of the length of the time series. Not all models and model parameters handle the (downweighing) of "old" data in a sufficient way.
This allows you to gain a better understanding on the influence on the length of your timeseries on forecast accuracy. There are cases which correspond to intuition that, on average, a longer time history improves forecast quality. However there are also cases where the individual optimal lengths of the time history can also be very short (from 1 to 12 months). This is especially true for time series where the underlying patterns change frequently and quickly and, therefore, the more recent history is more important than the older history.
The author published an article on medium.com "Determining the best length of the history of your timeseries data for timeseries forecasting" which shows the business results and business interpretation of this analysis. Find more links in the last section of this article.
Two macros are presented here.
Both macros are used on two different dataset to illustrate the usage and the interpretation of the findings.
The simulation procedure is as follows:
This procedure is also iterated for different start points of the time series. This means that the above setup is “shifted” in the time series to avoid studying results which only depend on one start point.
An example call of the macro is shown based on the SASHELP.AIR data.
As the macro requires finding time series ID for each time series, this variable has to be provided, even if the data only contain one time series.
data air;
set sashelp.air;
tsid = 1;
run;
You can invoke the macro on the SASHELP.AIR data as follows
%ts_history_check(data=air,tsid=tsid,y=air,
timeid=date,interval=month,
minhist=2,maxhist=48,
shiftfrom=0,shiftto=4,shiftby=1,
periodvalid=12,
mrep=sashelp.hpfdflt,sellist=tsfsselect,
stat=mape,aggrstat=median);
The output shows a linechart. The X-axis represents the respective number of available history months. The Y-axis represents the median MAPE value over all-time series and all shift scenarios.
You see that especially in the first 12-18 months additional history information decreases the forecast error. You also see that the median MAPE however slightly increases with a longer time history after ~2 years, indicating that a longer history is not necessarily beneficial for forecasting on this data.
The advantage of the HPFENGINE procedure is that is automatically evaluates a large list of possible timeseries models (ARIMA, ExpSmoothing, ...) . If you however do not have the HPFENGINE available, you can use the %TS_HISTORY_CHECK_ESM macro which uses exponential smoothing models.
%ts_history_check_esm(data=air,tsid=tsid,y=air,
timeid=date,interval=month,
minhist=2,maxhist=48,
shiftfrom=0,shiftto=4,shiftby=1,
periodvalid=12,
stat=mape,aggrstat=median);
You receive a similar output as above. However as only a limited set of forecasting models is used here the MAPE level is higher as above. You also see that if you only have ExpSmoothing models available, a longer time history is still beneficial. You also see the steep drop as soon as a full year of data history is available in the data (12, 24, 36, ... months).
As the macro loops over SHIFTS and TIME_HISTORIES and large amount of notes and warnings might be written to the SAS log. This can lead to the fact that the log fills up quickly.
You can use different options to turn of and turn on the generation of NOTES in the log or the printing of the macro code.
options nonotes nomprint ;
%ts_history_check_ESM(data=timeseries_retail,tsid=tsid,y=sales, ....... );
options notes mprint;
Another option is to pipe the log content into a file. Here you can use the PRINTTO procedure or specify an ALTLOG in your session.
proc printto log = "c:\tmp\logfile.txt";
run;
%TS_HISTORY_CHECK(...);
proc printto log=log;
run;
It can also happen that the macro execution shows an error, e.g. when you try to fit a certain timeseries model type with not enough data. This does not prevent the macro from functioning correctly. However you will see the error in the Logfile.
Again you use the %TS_HISTORY_CHECK macro. Data TIMESERIES_RETAIL contain weekly retail sales data for 170 weeks. The validation period is set to 13 weeks (one quarter) and the time history is iterated from 2 - 130 weeks.
%ts_history_check(data=timeseries_retail,tsid=tsid,y=sales,
timeid=date,interval=week,
minhist=2,maxhist=130,
shiftfrom=0,shiftto=42,shiftby=7,
periodvalid=13,
mrep=sashelp.hpfdflt,sellist=tsfsselect,
stat=mape,aggrstat=median);
A similar picture can be seen with the
The above result is of course not generalizable on other data. However it shows, that it makes sense to study the influence of the length data history on the forecast error with your individual data to have a better view on the importance of a long or short data history.
Fore completeness the case where not HPFENGINE procedure is available is shown as well.
%ts_history_check_ESM(data=timeseries_retail,tsid=tsid,y=sales,
timeid=date,interval=week,
minhist=2,maxhist=130,
shiftfrom=0,shiftto=42,shiftby=7,
periodvalid=13,
stat=mape,aggrstat=median);
The following parameters can be specified with the macro %TS_HISTORY_CHECK:
The following parameters are additionally available with the %ts_history_check_ESM macro:
Note that the SELLIST and the MREP parameter are not available with the %TS_HISTORY_CHECK_ESM macro.
The macro first prepares the input data by creating a surrogate LEAD variable which has negative values for the available history. Time series which have a length which is shorter then the requirement length based on the parameters (&MAXHIST, &SHIFTTO, &PERIODVALID) are removed from the analysis.
The macro has an outer loop on SHIFTS and an inner loop on available time histories.
%do shift = &shiftfrom. %to &shiftto. %by &shiftby.;
%do history = &minhist. %to &maxhist.;
In each run data are prepared accordingly to the parameters and only the defined subset of variables is provided to the analysis.
data _Hist_check_tmp_input_;
set _Hist_check_tmp_(where = (_count_obs_ >= &minHist.));
_y_valid_ = &y.;
if _lead_*(-1) <= (&periodvalid + &shift.) then &y. = .;
if _lead_*(-1) > (&periodvalid + &shift. + &history) then delete;
PROC HPFENGINE and PROC ESM are used to generate the forecasts in the two macros. They can also be replaced by any other forecasting procedure, such as PROC ARIMA, as long as an output is produced that corresponds to the output table structure of PROC HPFENGINE.
Depending on the available SAS modules you can run the macro with an SAS/ETS or SAS Econometrics license where the ESM procedure is used.
proc esm data=_Hist_check_tmp_input_ out=_out_
outfor = _Hist_check_tmp_fc_(drop = lower upper error std)
seasonality=&seasonality lead=&periodvalid;
by &tsid;
id &timeid interval = &interval;
forecast &y/model=&model;
run;
proc sql;
create table _Hist_check_tmp_fc_xt_
as select a.*,b._lead_,b._y_valid_
from _Hist_check_tmp_fc_(drop = _name_) as a
left join _Hist_check_tmp_input_ as b
on a.&tsid. = b.&tsid.
and a.&timeid. = b.&timeid.
order by &tsid., &timeid.;
quit;
data _ape_;
set _Hist_check_tmp_fc_xt_;
_FC_Period_ = (_lead_*(-1) <= &periodvalid);
_APE_ = abs(predict-_y_valid_)/_y_valid_;
_MS_ = (predict-_y_valid_)**2;
run;
proc means data = _ape_(rename =
(_ape_ = mape _ms_ = _mse_))
noprint nway;
class &tsid _fc_period_;
var mape _mse_;
output out = _mape_(drop=_type_ _freq_
where=(_fc_period_=1)) mean=;
run;
data _mape_;
set _mape_;
rmse = _mse_ ** 0.5;
drop _mse_;
shift=&shift;
History=&history;
drop _fc_period_;
run;
medium.com "Determining the best length of the history of your timeseries data for timeseries forecasting"
There are more SAS Communities from the author with a focus on time series analysis
Webinar at Youtube: https://www.youtube.com/watch?v=-WsYvSasl9w&list=PLdMxv2SumIKsqedLBq0t_a2_6d7jZ6Akq&index=3
This example has been taken from my SAS Press book Data Quality for Analytics Using SAS see especially chapter 20, pp 260, chapter 21, pp 265, Appendix D
Download the SAS programs and SAS datasets: https://github.com/gerhard1050/Data-Quality-for-Data-Science-Using-SAS
Further books of the author in SAS Press:
Hi Gerhard,
A good question all forecasters should validate... "How long should our optimal time series be which we feed into our time series forecasting models?" An excellent article, video and macros that is highly relevant in the sporadic world we are living and the affect it has on the historical data being used for forecasting.
Cheers,
Michelle
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.